-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZNS support #1298
ZNS support #1298
Conversation
56b6aa8
to
cf95cde
Compare
Rebased this PR onto develop |
Hi @MaisenbacherD - I believe that SPDK v23.01 is now integrated into Mayastor. |
@tiagolobocastro can you confirm that SPDK v23.01 is now the current default base SPDK version in the product? |
Hi @MaisenbacherD I just chatted with @tiagolobocastro and he clarified that we will be updating to the latest release of SPDK soon, which is v24.01. and that we are already at v23.01 as of now. |
Hi @orville-wright Let me know if there is something else I can do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't finished, will resume tomorrow
io-engine-tests/src/lib.rs
Outdated
nr_seq_zones: u32, | ||
max_active_zones: u32, | ||
max_open_zones: u32, | ||
) -> Result<u32, (i32, String)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return a new type Wrapper with RAII instead? example:
) -> Result<u32, (i32, String)> { | |
) -> Result<NullBlk, (i32, String)> { |
struct NullBlk(u32);
impl Drop for NullBlk {
fn drop(&mut self) {
delete_nullblk_device(self.0);
}
}
@@ -541,3 +542,92 @@ macro_rules! test_diag { | |||
println!($($arg)*); | |||
}} | |||
} | |||
|
|||
pub fn create_zoned_nullblk_device( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some doc comments explaining what this nullblk is, for example, does it just ignore writes and return zeroes?
@@ -103,6 +103,7 @@ url = "2.2.2" | |||
gettid = "0.1.2" | |||
async-process = { version = "1.5.0" } | |||
rstack = { version = "0.3.2" } | |||
jemalloc-sys = "0.5.2+5.3.0-patched" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this used for? A patched version makes wonder if this is stable enough..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for mentioning this!
I needed a way to allocate memory for the spdk_bdev_zone_info
c struct. I am using calloc
and free
of this allocator crate.
Can you suggest a nicer way to do this? :)
@hrudaya21 Unfortunately, there is no convenient way to allocate this struct within SPDK.
io-engine/src/bdev/device.rs
Outdated
SPDK_BDEV_ZONE_STATE_OFFLINE => Ok(SPDK_NVME_ZONE_STATE_OFFLINE), | ||
SPDK_BDEV_ZONE_STATE_EXP_OPEN => Ok(SPDK_NVME_ZONE_STATE_EOPEN), | ||
_ => { | ||
error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state); | |
error!("Can't map SPDK_BDEV_ZONE_STATE {bdev_zone_state} to any SPDK_NVME_ZONE_STATE"); |
io-engine/src/bdev/device.rs
Outdated
error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state); | ||
Err(CoreError::NvmeIoPassthruDispatch { | ||
source: Errno::EINVAL, | ||
opcode: 122, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 122 mean? and 121 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stands for Zone Management Receive and Zone Management Send respectively. I encoded those values into an enum.
io-engine/src/core/block_device.rs
Outdated
@@ -82,6 +85,14 @@ pub trait BlockDevice { | |||
&self, | |||
listener: DeviceEventSink, | |||
) -> Result<(), CoreError>; | |||
|
|||
fn is_zoned(&self) -> bool; | |||
fn get_zone_size(&self) -> u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in rust getters, don't tend to have get_ prefix
io-engine/src/bdev/nvme.rs
Outdated
bdev.remove_alias(self.url.as_ref()); | ||
let errno = unsafe { | ||
bdev_nvme_delete( | ||
self.name.clone().into_cstring().as_ptr(), | ||
std::ptr::null(), | ||
&mut path_id as *mut nvme_path_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be rebased
io-engine/src/bdev/nexus/nexus_io.rs
Outdated
@@ -253,10 +266,20 @@ impl<'n> NexusBio<'n> { | |||
if status == IoCompletionStatus::Success { | |||
self.ctx_mut().successful += 1; | |||
} else { | |||
error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have traces below? I'd prefer not to add too many traces in the IO path until we add a redirection. Currently traces are a blocking call so if stdout/err is not pulled fast enough it can block the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it makes sense. I will remove this line. :)
|
||
self.completion_error(child, status); | ||
// Don't take zoned child out on zoned related nvme errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we want to take it out in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zoned storage related status codes like TOO_MAY_OPEN_ZONES
or ZONE_IS_FULL
might be handled by the consuming application that is aware of the zoned storage interface.
Because no zoned replication is enabled yet, taking out the nexus is a bit harsh.
With zoned replication support later on we might want to take the nexus out again, depending on the logic of the replication.
) -> Result<Vec<(String, String)>, Error> { | ||
let mut children_devices = Vec::new(); | ||
for uri in children { | ||
let device_name = device_create(uri).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to handle errors here rather than unwrap
@tiagolobocastro Thanks so far! I will address your comments soon after you are done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsavitskiy @dsharma-dc @hrudaya21 please take a look as well
max_zone_append_size: u32, | ||
max_open_zones: u32, | ||
max_active_zones: u32, | ||
optimal_open_zones: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have some doc comments here describing what these mean for the folks who are not very familiar with zones, such as myself :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add doc comments. Struct moves to spdk-rs/src/bdev.rs
as I refactored the with_zoned
bdev builder function to take the ZoneInfo
and spdk-rs should not depend on mayastor.
impl Default for ZoneInfo { | ||
fn default() -> ZoneInfo{ | ||
ZoneInfo{ | ||
zoned: false, | ||
num_zones: Default::default(), | ||
zone_size: Default::default(), | ||
max_zone_append_size: Default::default(), | ||
max_open_zones: Default::default(), | ||
max_active_zones: Default::default(), | ||
optimal_open_zones: Default::default(), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply add Default to the derive above
impl Default for ZoneInfo { | |
fn default() -> ZoneInfo{ | |
ZoneInfo{ | |
zoned: false, | |
num_zones: Default::default(), | |
zone_size: Default::default(), | |
max_zone_append_size: Default::default(), | |
max_open_zones: Default::default(), | |
max_active_zones: Default::default(), | |
optimal_open_zones: Default::default(), | |
} | |
} | |
} |
@@ -387,6 +418,13 @@ impl<'n> Nexus<'n> { | |||
.with_block_count(0) | |||
.with_required_alignment(9) | |||
.with_data(n) | |||
.with_zoned(zone_info.zoned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be tidier to add a with_zone method which takes the zone_info rather than adding all these here 1by1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Will change this :)
@@ -404,6 +442,7 @@ impl<'n> Nexus<'n> { | |||
name.to_string(), | |||
n.bdev.as_mut().unwrap(), | |||
)); | |||
n.is_zoned = zone_info.zoned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this not be set in the build()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the is_zoned
field as it was unused. Zoned information can be retrieved via the bdev.
let mut info_set = false; | ||
|
||
for (_uri, device_name) in &*children_devices { | ||
let dev = device_lookup(&device_name).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should return an error
let mut zone_info = ZoneInfo::default(); | ||
let mut conventional = false; | ||
let mut info_set = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another name for non-zoned blk devices btw? (rather than conventional)
let mut zone_info = ZoneInfo::default(); | |
let mut conventional = false; | |
let mut info_set = false; | |
// if we find non-zoned block devices. | |
let mut found_conventional = false; | |
let mut zone_info: Option<ZoneInfo> = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, non-zoned blk devs are called "conventional". In the NVMe context, one could also label such a device as an NVM device (vs ZNS for zoned block devices), which refers to the I/O command set.
|
||
if dev.is_zoned() { | ||
zone_info.zoned = true; | ||
if !info_set { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let Some(zone_info) = zone_info.as_mut() or match
} | ||
|
||
if zone_info.zoned == conventional { | ||
error!("Can not handle conventional and zoned storage at the same time in a nexus"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe include the nexus in the log too?
error!("Can not handle conventional and zoned storage at the same time in a nexus"); | |
error!("{nexus:?} - can not handle conventional and zoned storage at the same time in a nexus"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nexus does not exist at this point, but I can include the name of the requested nexus.
} | ||
|
||
unsafe { | ||
free(bdev_zone_info_c_void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be done with RAII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely yes. Let me wait on that with the discussion above about the jemalloc-sys crate. :)
|
||
let bdev_zone_info_c_void = unsafe { calloc(1, size_of_spdk_bdev_zone_info) }; | ||
loop { | ||
if zone >= max_num_zones_to_report { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question I guess, should this be async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)
io-engine/src/core/env.rs
Outdated
@@ -814,6 +817,29 @@ impl MayastorEnvironment { | |||
None | |||
} | |||
|
|||
fn init_tracing(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_spdk_tracing
perhaps.
for uri in children { | ||
let device_name = device_create(uri).await.unwrap(); | ||
if device_lookup(&device_name).unwrap().is_zoned() && children.len() > 1 { | ||
return Err(Error::ZonedReplicationNotImplemented {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are returning without creating all devices, don't we want to destroy back what we would've already created?
|
||
// Bit 07:00 Dword 13 > Zone Send Action | ||
let zsa = zone_send_action_to_bdev_zone_action(nvme_cmd.cdw13 as u8).unwrap(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid unwrap() here.
io-engine/src/bdev/device.rs
Outdated
let bdev_zone_infos; | ||
|
||
let ret = unsafe { | ||
bdev_zone_infos = calloc(max_num_zones_to_report as usize, size_of_spdk_bdev_zone_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to have this allcation inside the SPDK function, so that, if any failure happens, We can have a graceful exit here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is no convenient way to allocate this struct within SPDK.
Can you suggest a nicer way to allocate memory for this struct? :)
if dev.is_zoned() { | ||
//TODO: Implement partitioning zoned block devices. This requires handling drive resources like max active/open zones. | ||
warn!("The device '{}' is zoned. Partitioning zoned block devices into smaller devices is not implemented. Using the whole device.", dev.device_name()); | ||
start_blk = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not work correctly because we have to reserve some space in the child device for metadata. That space calculation is done by calc_data_partition
and start/end blocks are hence adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I was not aware that a metadata partition is necessary. I assumed that partitioning is just a feature that allows the creation of multiple Nexus from the same drive.
What kind of metadata needs space on the drive and where in the code can I find the IO requests being issued for those metadata operations? :)
Also, I am not sure if I fully understand what is happening in the match partition::calc_data_partition(self.req_size(), nb, bs) {
code block. Does that mean the first self.req_size()
bytes are reserved for metadata?
@@ -82,6 +85,14 @@ pub trait BlockDevice { | |||
&self, | |||
listener: DeviceEventSink, | |||
) -> Result<(), CoreError>; | |||
|
|||
fn is_zoned(&self) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've the same suggestion.
@@ -62,6 +62,9 @@ pub trait BlockDevice { | |||
/// Checks whether target I/O type is supported by the device. | |||
fn io_type_supported(&self, io_type: IoType) -> bool; | |||
|
|||
/// Checks whether target I/O type is supported by the device. | |||
fn io_type_supported_by_device(&self, io_type: IoType) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from io_type_supported()? Please describe the difference in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question!
The io_type_supported_by_device
still checks the supported io types of the drive.
For the ZNS feature, I had to implement NVMe IO passthrough support (see commit "feat(nexus): Add NVMe IO passthrough support") to enable Zone Management Send and Receive commands.
The io_type_supported
returns always true for NvmeIo commands, because also zoned SpdkBlockDevices
need to support the NvmeIo commands that are being issued to the zoned Nexus NVMe-oF target. Because SpdkBlockDevices
do not natively support NvmeIo commands, the commit "feat(nexus): Emulate zone mgmt NvmeIo for SpdkBlockDevices" emulates this command.
The submit_io_passthru
function takes the NvmeIo
commands and either passes it on to the device if that command is natively supported or it checks if the NvmeIo.opc
is emulated in software. Finally, the NvmeIo command gets rejected (errno::EOPNOTSUPP
) if no emulation for that opcode is present.
I will adjust the docs of io_type_supported
Let me know if you have further questions or ideas on how to refactor this. :)
@@ -82,6 +85,14 @@ pub trait BlockDevice { | |||
&self, | |||
listener: DeviceEventSink, | |||
) -> Result<(), CoreError>; | |||
|
|||
fn is_zoned(&self) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add meaningful comments for all new methods, ideally with args and return value descriptions.
io-engine/src/core/block_device.rs
Outdated
@@ -212,7 +223,51 @@ pub trait BlockDeviceHandle { | |||
cb_arg: IoCompletionCallbackArg, | |||
) -> Result<(), CoreError>; | |||
|
|||
/// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add meaningful comments for all new methods, ideally with args and return value descriptions.
@@ -199,7 +199,7 @@ impl Default for MayastorCliArgs { | |||
reactor_mask: "0x1".into(), | |||
mem_size: 0, | |||
rpc_address: "/var/tmp/mayastor.sock".to_string(), | |||
no_pci: true, | |||
no_pci: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why this change is needed? Is it only for ZNS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would just be a 'nice to have'. It does not only target ZNS.
As part of the "feat(test): Enable PCIe device testing" commit, this enables testing on real PCIe devices. The idea is to later write some test cases that can run in a CI environment on real devices.
Let me know if I should drop this change :)
@@ -517,6 +556,10 @@ impl<'n> Nexus<'n> { | |||
unsafe { self.bdev().required_alignment() } | |||
} | |||
|
|||
pub fn is_zoned(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a doc comment for all new fields, methods, etc.
if dev.is_zoned() { | ||
zone_info.zoned = true; | ||
if !info_set { | ||
zone_info.num_zones = dev.get_num_zones(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like adding a couple of helper methods to copy and compare zone info would make code much cleaner.
io-engine/src/bdev/nexus/nexus_io.rs
Outdated
| IoType::Reset | ||
| IoType::Unmap | ||
| IoType::Flush => self.submit_all(), | ||
IoType::ZoneAppend => { | ||
warn!("ZoneAppend is explicitly disallowed, otherwise reading from different replicas won't work."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the nexus info in the warn! message.
io-engine/src/core/mod.rs
Outdated
pub fn is_zoned_nvme_error(status: IoCompletionStatus) -> bool { | ||
match status { | ||
IoCompletionStatus::NvmeError(NvmeStatus::CommandSpecific(cssc)) => match cssc { | ||
CommandSpecificStatusCode::ZonedBoundaryError => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipe | operator would make this code clearer.
@@ -350,6 +350,23 @@ impl From<NvmeStatus> for IoCompletionStatus { | |||
} | |||
} | |||
|
|||
pub fn is_zoned_nvme_error(status: IoCompletionStatus) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a doc comment for all new fields, methods, etc.
zone_info.max_active_zones = dev.get_max_active_zones(); | ||
zone_info.optimal_open_zones = dev.get_optimal_open_zones(); | ||
info_set = true; | ||
} else if zone_info.num_zones != dev.get_num_zones() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better implement or derive PartialEq
trait for ZoneInfo
let mut children_devices = Vec::new(); | ||
for uri in children { | ||
let device_name = device_create(uri).await.unwrap(); | ||
if device_lookup(&device_name).unwrap().is_zoned() && children.len() > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid unwrap() here as well.
@@ -1371,6 +1480,14 @@ async fn nexus_create_internal( | |||
return Ok(()); | |||
} | |||
|
|||
let mut children_devices = create_children_devices(children).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the normal child (not zoned device), whether this function will fail ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal case, the create_children_devices
function is only failing if the device_create
function returns an error.
hi @MaisenbacherD it looks like all the requested reviewers have given significant feedback on the PR. So you should have lots of guidance on moving this project forward now. Let me know of you need any more assistance. Happy to help. |
Thanks everyone for the reviews! I am addressing your comments in the following days :) |
io-engine/src/bdev/nexus/nexus_io.rs
Outdated
} else { | ||
match orig_nvme_cmd.opc() { | ||
// Zone Management Send | ||
121 => return hdl.emulate_zone_mgmt_send_io_passthru( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to have opcode enum.
@@ -7,7 +7,7 @@ use futures::{future::Future, FutureExt}; | |||
|
|||
mod nexus_bdev; | |||
mod nexus_bdev_children; | |||
mod nexus_bdev_error; | |||
pub mod nexus_bdev_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be pub(crate) and maybe at other places?
io-engine/src/bdev/nvmx/device.rs
Outdated
self.ns.is_zoned() | ||
} | ||
|
||
fn get_zone_size(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to use get_
prefix?
|
||
stdenv.mkDerivation rec { | ||
pname = "btrfs-progs"; | ||
version = "5.14.1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not pin the version? Also we use 5.6.x
for btrfs operations on csi, so maybe it should be at least be in par with that.
hi @MaisenbacherD - checking in here on where things are on the project?
|
Move children device creation out from `Nexus::new_child` into a helper function. Create the children devices before the Nexus is created and add them after the Nexus's initialization. This is in preperation for zoned storge support, as we need to inspect the children device features before creating a Nexus such that we can decide whether the Nexus must be zoned or not. Signed-off-by: Dennis Maisenbacher <[email protected]>
Adding setter and getter for zone storage related fields to the Nexus and BlockDevice. Inspect the children before creating the Nexus to decide whether the Nexus needs to be zoned or not. The Nexus inherrits the zoned related fields form its child. For now a zoned Nexus does not allow replication. Partitioning of a zoned Nexus is also not supported for now. The user will optain the whole zoned Nexus if requested. Signed-off-by: Dennis Maisenbacher <[email protected]>
Passthrough NvmeIo if the child (NVMe device) supports it. This is necessary for the Zoned Storage support, as the Zone Management Send and Receive commands are issued to the Nexus as NvmeIo. Those commands are issued when zones are reported or the state of a zone is actively changed. Signed-off-by: Dennis Maisenbacher <[email protected]>
Zoned block devices that do not support NvmeIo (zoned uring) need to get the zone management send and receive NvmeIo translated that is beeing issued on the Nexus. Those commands are issued when zones are reported or the state of a zone is actively changed. Inspect fields of the incoming NVMe command and fill the buffer according to the 'NVM Express Zoned Namespace Command Set Specification' (see https://nvmexpress.org/developers/nvme-command-set-specifications/) with the help of the SPDK zoned bdev wrapper functions. Signed-off-by: Dennis Maisenbacher <[email protected]>
Testing zoned storage support through zoned uring with a backing zoned nullblk device. Signed-off-by: Dennis Maisenbacher <[email protected]>
Signed-off-by: Dennis Maisenbacher <[email protected]>
Signed-off-by: Dennis Maisenbacher <[email protected]>
Zoned related CommandSpecificStatusCodes's should not take out the nexus. Signed-off-by: Dennis Maisenbacher <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, everyone for the helpful reviews, and my apologies for the late reply!
I hope that I addressed all comments with either a discussion or a direct code change.
There are some more open discussions to resolve :)
Also, I will address the btrfs nix package comments and post a fix for the Failed to write to /dev/nvme-fabrics: Invalid argument
on a more recent kernel when I am back from vacation in two weeks. :)
Please also have a look at the PR for spdk-rs that is required by this series: openebs/spdk-rs#24
@@ -199,7 +199,7 @@ impl Default for MayastorCliArgs { | |||
reactor_mask: "0x1".into(), | |||
mem_size: 0, | |||
rpc_address: "/var/tmp/mayastor.sock".to_string(), | |||
no_pci: true, | |||
no_pci: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would just be a 'nice to have'. It does not only target ZNS.
As part of the "feat(test): Enable PCIe device testing" commit, this enables testing on real PCIe devices. The idea is to later write some test cases that can run in a CI environment on real devices.
Let me know if I should drop this change :)
@@ -1371,6 +1480,14 @@ async fn nexus_create_internal( | |||
return Ok(()); | |||
} | |||
|
|||
let mut children_devices = create_children_devices(children).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal case, the create_children_devices
function is only failing if the device_create
function returns an error.
@@ -83,7 +174,7 @@ impl SpdkBlockDevice { | |||
pub fn lookup_by_name(name: &str) -> Option<Box<dyn BlockDevice>> { | |||
debug!("Searching SPDK devices for '{}'...", name); | |||
let bdev = UntypedBdev::lookup_by_name(name)?; | |||
debug!("SPDK {} device found: '{}'", bdev.driver(), name); | |||
debug!("SPDK {} device found: '{}'", bdev.driver(), bdev.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no difference between name
and bdev.name()
.
This change stuck from debugging and was meant to clarify that the name of the found bdev
is supposed to be printed out and match with the debug statement above.
I can revert this change if you want :)
@@ -387,6 +418,13 @@ impl<'n> Nexus<'n> { | |||
.with_block_count(0) | |||
.with_required_alignment(9) | |||
.with_data(n) | |||
.with_zoned(zone_info.zoned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Will change this :)
@@ -253,6 +256,7 @@ pub struct Nexus<'n> { | |||
pub(super) nexus_target: Option<NexusTarget>, | |||
/// Indicates if the Nexus has an I/O device. | |||
pub(super) has_io_device: bool, | |||
pub(super) is_zoned: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this field as it was unused. Zoned information can be retrieved via the bdev.
} | ||
|
||
unsafe { | ||
free(bdev_zone_info_c_void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely yes. Let me wait on that with the discussion above about the jemalloc-sys crate. :)
|
||
let bdev_zone_info_c_void = unsafe { calloc(1, size_of_spdk_bdev_zone_info) }; | ||
loop { | ||
if zone >= max_num_zones_to_report { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)
|
||
let mut result; | ||
loop { | ||
result = unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)
io-engine/src/bdev/nexus/nexus_io.rs
Outdated
@@ -253,10 +266,20 @@ impl<'n> NexusBio<'n> { | |||
if status == IoCompletionStatus::Success { | |||
self.ctx_mut().successful += 1; | |||
} else { | |||
error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it makes sense. I will remove this line. :)
|
||
self.completion_error(child, status); | ||
// Don't take zoned child out on zoned related nvme errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zoned storage related status codes like TOO_MAY_OPEN_ZONES
or ZONE_IS_FULL
might be handled by the consuming application that is aware of the zoned storage interface.
Because no zoned replication is enabled yet, taking out the nexus is a bit harsh.
With zoned replication support later on we might want to take the nexus out again, depending on the logic of the replication.
cf95cde
to
344c3fb
Compare
Thanks @MaisenbacherD - We also appreciate that an industry respected storage innovator (WD.com) supports openEBS with their resources & your time/expertise. Thank you WD.com The team will review and pass on comments. Great progress. |
Hi @orville-wright, Apart from the review comments, would it be possible to get some help with enabling the latest btrfs, libnvme and nvme-cli nix packages? I am not that fluent in nix yet. :) |
Sure thing. I'll loop get you connected with the right people on the team. |
Hey @tiagolobocastro @niladrih and @Abhinandan-Purkait - Dennis @MaisenbacherD needs some eng help with some of the packages... specifically btrfs, libnvme and nvme-cli nix.
~Dave |
Sure. FYI, btrfs support has already been added to mayastor, We just need to specify |
Hi @MaisenbacherD looks like some feedback was given by @Abhinandan-Purkait , but I'm not sure if he provided enough of the info you needed ? |
@orville-wright, @Abhinandan-Purkait great! I was not aware that btrfs is already an option :) The current libnvme and nvme-cli nix packages in use look outdated. I would need help to get the latest nix upstream packages configured. I am not sure that adjusting e.g. |
To be honest, you can probably just remove the specific files we've got for those, and just rely on the versions from nixpkgs. |
@MaisenbacherD sorry for short notice, would you be able to join our community meeting today: https://us05web.zoom.us/j/87535654586?pwd=CigbXigJPn38USc6Vuzt7qSVFoO79X.1 |
Hi @tiagolobocastro. Yes, I will join :) Talk to you soon! |
Should we close this PR for now? I guess we forgot to add an update here @MaisenbacherD It's been a while, so I hope I'm not misremembering it @MaisenbacherD? (btw besides the nexus changes, there are some code changes here which would be interesting to take as it adds value by itself) |
@tiagolobocastro Thanks for remembering to post our discussions here :) I am closing this PR draft now. Also, I will make sure to bring the ZNS-unrelated commits of this PR to the CSAL patchset :) Thanks, everyone for the in-depth discussions on this! |
This draft PR addresses the issue #1297 and shall encourage discussion.
This should not be merged as is as there are two dependencies that need to be resolved before this can potentially be merged: